{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color='green'> RCV1 - Clustering (Sparse Matrix)<font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 1. Description<font>\n",
    "    \n",
    "Clustering of documents using k-means. Dataset is downloaded by scikit-learn.\n",
    "\n",
    "This demo categorizes the document into culsters using kmeans.\n",
    "The dataset has manually assigned classes, which is used to check if the clustering is working appropriately.\n",
    "(there are multiple classes in the manually assigned classes, so the only the first one is used.) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 2. Data Preprocessing<font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_rcv1\n",
    "from sklearn import metrics\n",
    "import numpy as np\n",
    "\n",
    "rcv1 = fetch_rcv1()\n",
    "X = rcv1.data\n",
    "target = rcv1.target\n",
    "\n",
    "# though there are multiple classes, only the first class is used\n",
    "idx = target.indices\n",
    "off = target.indptr\n",
    "offsize = off.shape[0]\n",
    "y = [idx[i] for i in off[0:offsize-1]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_clusters = 100\n",
    "max_iter = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 3. Implementation using Frovedis<font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frovedis train time: 2.994 sec\n",
      "Frovedis Homogeneity Score : 0.5026113149386877\n"
     ]
    }
   ],
   "source": [
    "# train\n",
    "import os, time\n",
    "from frovedis.mllib.cluster import KMeans as frovKMeans\n",
    "from frovedis.exrpc.server import FrovedisServer\n",
    "FrovedisServer.initialize(\"mpirun -np 8 {}\".format(os.environ['FROVEDIS_SERVER']))\n",
    "\n",
    "frov_km = frovKMeans(n_clusters=n_clusters, max_iter=max_iter, init='random', n_init=1)\n",
    "start = time.time()\n",
    "frov_y_pred = frov_km.fit_predict(X)\n",
    "end = time.time()\n",
    "print (\"Frovedis train time: {:.3f} sec\".format(end-start))\n",
    "frov_homogeneity_score = metrics.homogeneity_score(y, frov_y_pred)\n",
    "print('Frovedis Homogeneity Score : '+ str(frov_homogeneity_score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 4. Implementation using scikit-learn<font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/takuya/venv/frovedis_python3/lib64/python3.6/site-packages/sklearn/cluster/_kmeans.py:939: FutureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 0.25.\n",
      "  \" removed in 0.25.\", FutureWarning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "scikit-learn train time: 26.446 sec\n",
      "scikit-learn Homogeneity Score : 0.5071450369250687\n"
     ]
    }
   ],
   "source": [
    "from sklearn.cluster import KMeans as skKMeans\n",
    "\n",
    "# algorithm=\"full\" is better than \"auto\" (elkan)\n",
    "sk_km = skKMeans(n_clusters=n_clusters, max_iter=max_iter, init='random',  algorithm=\"full\", n_init=1, n_jobs=12)\n",
    "start = time.time()\n",
    "sk_y_pred = sk_km.fit_predict(X)\n",
    "end = time.time()\n",
    "print (\"scikit-learn train time: {:.3f} sec\".format(end-start))\n",
    "sk_homogeneity_score = metrics.homogeneity_score(y, sk_y_pred)\n",
    "print('scikit-learn Homogeneity Score : '+ str(sk_homogeneity_score))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
